Method and system for the training of parameters of a pattern recognition system, each parameter being associated with exactly one realization variant of a pattern from an inventory

ABSTRACT

The invention relates to a method of training parameters of a pattern recognition system, each parameter being associated with exactly one realization variant of a pattern from an inventory, comprising the steps of:  
     making available a training set of patterns, and  
     determining the parameters through discriminative optimization of a target function, and to a system for carrying out the above method.

[0001] Method and system for the training of parameters of a patternrecognition system, each parameter being associated with exactly onerealization variant of a pattern from an inventory

[0002] The invention relates to a method and a system for the trainingof parameters of a pattern recognition system, each parameter beingassociated with exactly one realization variant of a pattern from aninventory, and in particular to a method and a system for the trainingof parameters of a speech recognition system which are each associatedwith exactly one pronunciation variant of a word from a vocabulary.

[0003] Pattern recognition systems, and in particular speech recognitionsystems, are used for a large number of applications. Examples areautomatic telephone information systems such as, for example, the flightinformation service of the German air carrier Lufthansa, automaticdictation systems such as, for example, FreeSpeech of the PhilipsCompany, handwriting recognition systems such as the automatic addressrecognition system used by the German Postal Services, and biometricalsystems which are often proposed for personal identification, forexample for the recognition of fingerprints, the iris, or faces. Suchpattern recognition systems may in particular also be used as componentsof more general pattern processing systems, as is evidenced by theexample of personal identification mentioned above.

[0004] Many known systems use statistical methods for comparing unknowntest patterns with reference patterns known to the system for therecognition of these test patterns. The reference patterns arecharacterized by means of suitable parameters, and the parameters arestored in the pattern recognition system. Thus, for example, manypattern recognition systems use a vocabulary of single words as therecognition units, which are subsequently subdivided into so-termedsub-word units for an acoustical comparison with an unknown spokenutterance. These “words” may be words in the linguistic sense, but it isusual in speech recognition to interpret the notion “word” more widely.In a spelling application, for example, a single letter may constitute aword, while other systems use syllables or statistically determinedfragments of linguistic words as words for the purpose of theirrecognition vocabularies.

[0005] The problem in automatic speech recognition lies inter alia inthe fact that words may be pronounced very differently. Such differencesarise on the one hand between different speakers, may follow from aspeaker's state of mind, or are influenced by the dialect used by thespeaker in the articulation of the word. On the other hand, veryfrequent words may in particular be spoken with a different soundsequence in spontaneous speech as compared with the sequence typical ofcarefully read-aloud speech. Thus, for example, it is usual to shortenthe pronunciation of words: “would” may become “'d” and “can” may become“c'n”.

[0006] Many systems use so-termed pronunciation variants for modelingdifferent pronunciations of one and the same word. If, for example, thel^(th) word w_(l) of a vocabulary V can be pronounced in different ways,the j^(th) manner of pronunciation of this word may be modeled throughthe introduction of a pronunciation variant v_(lj). The pronunciationvariant v_(lj) is then composed of those sub-word units which fit thej^(th) manner of pronunciation of w_(l). Phonemes, which model theelementary sounds of a language, may be used as the sub-word units forforming the pronunciation variants. However, statistically derivedsub-word units are also used. So-termed Hidden Markov Models are oftenused as the lowest level of acoustical modeling.

[0007] The concept of a pronunciation variant of a word as used inspeech recognition was clarified above, but this concept may be appliedin a similar manner to the realization variant of a pattern from aninventory of a pattern recognition system. The words from a vocabularyin a speech recognition system correspond to the patterns from theinventory, i.e. the recognition units, in a pattern recognition system.Just as words may be pronounced differently, so may the patterns fromthe inventory be realized in different ways. Words may thus be writtendifferently manually and on a typewriter, and a given facial expressionsuch as, for example, a smile, may be differently constituted independence on the individual and the situation. The considerations ofthe invention are accordingly applicable to the training of parametersassociated with exactly one realization variant of a pattern from aninventory in a general pattern recognition system, although for reasonsof economy they are disclosed in the present document mainly withreference to a speech recognition system.

[0008] As was noted above, many pattern recognition systems compare anunknown test pattern with the reference patterns stored in theirinventories so as to determine whether the test pattern corresponds toany, and if so, to which reference pattern. The reference patterns arefor this purpose provided with suitable parameters, and the parametersare stored in the pattern recognition system. Pattern recognitionsystems based in particular on statistical methods then calculate scoresindicating how well a reference pattern matches a test pattern andsubsequently attempt to find the reference pattern with the highestpossible score, which will then be output as the recognition result forthe test pattern. Following such a general procedure, scores will beobtained in accordance with pronunciation variants used, indicating howwell a spoken utterance matches a pronunciation variant and how well thepronunciation variant matches a word, i.e. in the latter case a score asto whether a speaker has pronounced the word in accordance with thispronunciation variant.

[0009] Many speech recognition systems use as their scores quantitieswhich are closely related to probability models. This may be constitutedas follows, for example: it is the task of the speech recognition systemto find for a spoken utterance x that word sequence ŵ₁ ^(N)=(ŵ₁, ŵ₂, . .. , ŵ_(N)) of N words, N being unknown, which of all possible wordsequences w₁ ^(N)′ with all possible lengths N′ optimally matches thespoken utterance x, i.e. having the highest conditional probability inview of the condition x: $\begin{matrix}{{\hat{w}}_{1}^{N} = {\arg \quad {\max\limits_{w_{1}^{N^{\prime}}}{p\quad {\left( {w_{1}^{N^{\prime}}x} \right).}}}}} & (1)\end{matrix}$

[0010] Applying the Bayes' theorem yields a known model partition:$\begin{matrix}{{\hat{w}}_{1}^{N} = {\arg \quad {\max\limits_{w_{1}^{N^{\prime}}}{{p\left( {xw_{1}^{N^{\prime}}} \right)} \cdot {{p\left( w_{1}^{N^{\prime}} \right)}.}}}}} & (2)\end{matrix}$

[0011] The possible pronunciation variants v₁ ^(N)′ associated with theword sequence w₁ ^(N)′ can be introduced by summation: $\begin{matrix}{{{p\left( {xw_{1}^{N^{\prime}}} \right)} = {\sum\limits_{v_{1}^{N^{\prime}}}\quad {{p\left( {xv_{1}^{N^{\prime}}} \right)} \cdot {p\left( {v_{1}^{N^{\prime}}w_{1}^{N^{\prime}}} \right)}}}},} & (3)\end{matrix}$

[0012] because it is assumed that the dependence of the spoken utterancex on the pronunciation variant v₁ ^(N)′ and the word sequence w₁ ^(N)′is defined exclusively by the sequence of pronunciation variants v₁^(N)′.

[0013] For further modeling of the dependence p(v₁ ^(N)′|w₁ ^(N)′), aso-termed unigram assumption is usually made, which disregards contextinfluences: $\begin{matrix}{\left. {\left. {{p\left( v_{1}^{N^{\prime}} \right.}w_{1}^{N^{\prime}}} \right) = \left. {\prod\limits_{i = 1}^{N^{\prime}}\quad {p\left( v_{i} \right.}} \middle| w_{i} \right.} \right).} & (4)\end{matrix}$

[0014] If the l^(th) word of the vocabulary V of the speech recognitionsystem is denoted w₁, the j^(th) pronunciation variant of this word isdenoted v_(lj), and the frequency with which the pronunciation variantv_(lj) occurs in the sequence of pronunciation variants v₁ ^(N)′ isdenoted h_(lj)(v₁ ^(N)′) (for example, the frequency of thepronunciation variant “cuppa” in the utterance “give me a cuppa coffee”is 1, but that of the pronunciation variant “cup of” is 0), then thelatter expression may also be written: $\begin{matrix}{{{p\left( {v_{1}^{N^{\prime}}w_{1}^{N^{\prime}}} \right)} = {\prod\limits_{l = 1}^{D}{\quad \left\lbrack {p\left( {v_{lj}w_{l}} \right)} \right\rbrack^{h_{lj}{(v_{1}^{N^{\prime}})}}}}},} & (5)\end{matrix}$

[0015] in which the product is now formed for all D words of thevocabulary V.

[0016] The quantities p(v_(lj)|w_(l)), i.e. the conditionalprobabilities that the the pronunciation variant v_(lj) is spoken forthe word w_(l), are parameters of the speech recognition system whichare each associated with exactly one pronunciation variant of a wordfrom the vocabulary in this case. They are estimated in a suitablemanner in the course of the training of the speech recognition system bymeans of a training set of spoken utterances available in the form ofacoustical speech signals, and their estimated values are introducedinto the scores of the recognition alternatives in the process ofrecognition of unknown test patterns on the basis of the above formulas.

[0017] Where the probability procedure usual in pattern recognition wasused in the above discussion, it will be obvious to those skilled in theart that general evaluation functions are usually applied in practicewhich do not fulfill the conditions of a probability. Thus, for example,the standardization condition is often not regarded as necessary offulfillment, or instead of a probability p , a quantityp^(λ)exponentially modified with a parameter λ is often used. Manysystems also operate with the negative logarithms of these quantities:−λ log p , which are then often regarded as the “scores”. Whenprobabilities are mentioned in the present document, accordingly, themore general evaluation functions familiar to those skilled in the artare also deemed to be included in this term.

[0018] Training of the parameters p(v_(lj), |w_(l)) of a speechrecognition system, which are each associated with exactly onepronunciation variant v_(lj) of a word w_(l) from a vocabulary, involvesthe use of a “maximum likelihood” method in many speech recognitionsystems. It can thus be determined, for example, in the training set howoften the respective variants v_(lj) of the word w_(l) are pronounced.The relative frequencies ƒ_(rel)(v_(lj)|w_(l)) observed from thetraining set then serve, for example, directly as estimated values forthe parameters p(v_(lj)|w_(l)) or alternatively are first subjected toknown statistical smoothing operations such as, for example,discounting.

[0019] U.S. Pat. No. 6,076,053 by contrast discloses a method by whichthe pronunciation variants of a word from a vocabulary are merged into apronunciation networks structure. The arcs of such a pronunciationnetwork structure consist of the sub-word units, for example phonemes inthe form of HMMs (“sub-word (phoneme) HMMs assigned to the specificarc”), of the pronunciation variants. To answer the question whether acertain pronunciation variant v_(lj) of a word w_(l) from the vocabularywas spoken, weight multiplicative, weight additive, and phone durationdependent weight parameters are introduced at the level of the arcs ofthe pronunciation network, or alternatively at the sub-level of the HHMstates of the arcs.

[0020] In the method proposed in U.S. Pat. No. 6,076,053, the scoresp(v_(lj)|w_(l)) are not used. Instead, in using the weight parameterse.g. at the arc level, a score ρ_(j) ^((k)) is assigned to arc j in thepronunciation network for the k^(th) word, ρ_(j) ^((k)) being forexample a (negative) logarithm of the probability. In arc levelweighting an arc j is assigned a score ρ_(j) ^((k)). In a presentlypreferred embodiment, this score is a logarithm of the likelihood.) Thisscore is subsequently modified with a weight parameter. (“Applying arclevel weighting leads to a modified score g_(j) ^((k)): g_(j)^((k))=u_(j) ^((k))·ρ_(j) ^((k))+c_(j) ^((k))”). The weight parametersthemselves are determined by discriminative training, for examplethrough minimizing of the classification error rate in a training set(“optimizing the parameters using a minimum classification errorcriterion that maximizes a discrimination between differentpronunciation networks”).

[0021] The invention has for its object to provide a method and a systemfor the training of parameters of a pattern recognition system, eachpattern being associated with exactly one realization variant of apattern from an inventory, and in particular to a method and a systemfor the training of parameters of a speech recognition system which areeach associated with exactly one pronunciation variant of a word from avocabulary, wherein the pattern recognition system is given a highdegree of accuracy in the recognition of unknown test patterns.

[0022] This object is achieved by means of a method of trainingparameters of a pattern recognition system, each parameter beingassociated with exactly one realization variant of a pattern from aninventory, which method comprises the steps of:

[0023] making available a training set of patterns, and

[0024] determining the parameters through discriminative optimization ofa target function,

[0025] and by means of a system for the training of parameters of apattern recognition system, each parameter being associated with exactlyone realization variant of a pattern from an inventory, which system isdesigned for:

[0026] making available a training set of patterns, and

[0027] determining the parameters through discriminative optimization ofa target function,

[0028] and in particular by means of a method of training parameters ofa speech recognition system, each parameter being associated withexactly one pronunciation variant of a word from a vocabulary, whichmethod comprises the steps of:

[0029] making available a training set of acoustical speech signals, and

[0030] determining the parameters through discriminative optimization ofa target function,

[0031] as well as by means of a system for the training of parameters ofa speech recognition system, each parameter being associated withexactly one pronunciation variant of a word from a vocabulary, whichsystem is designed for:

[0032] making available a training set of acoustical speech signals, and

[0033] determining the parameters through discriminative optimization ofa target function.

[0034] The dependent claims 2 to 5 relate to advantageous furtherembodiments of the invention. They relate to the form in which theparameters are assigned to the scores p(v_(lj)|w_(l)), the details ofthe target function, the nature of the various scores, and the method ofoptimizing the target function.

[0035] In claims 9 and 10, however, the invention relates to theparameters themselves which were trained by a method as claimed in claim7 as well as to any data carriers on which such parameters are stored.

These and further aspects of the invention will be explained in moredetail below with reference to embodiments and the appended drawing, inwhich:

[0036]FIG. 1 shows an embodiment of a system according to the inventionfor the training of parameters of a speech recognition system which areeach associated with exactly one pronunciation variant of a word from avocabulary, and

[0037]FIG. 2 shows the embodiment of a method according to the inventionfor the training of parameters of a speech recognition system which areeach associated with exactly one pronunciation variant of a word from avocabulary in the form of a flowchart.

[0038] The parameters p(v_(lj)|w_(l)) of a speech recognition systemwhich are associated with exactly one pronunciation variant V_(lj) of aword w_(l) from a vocabulary may be directly fed to a discriminativeoptimization of a target function. Eligible target functions are interalia the sentence error rate, i.e. the proportion of spoken utterancesresognized as erroneous (minimum classification error) and the worderror rate, i.e. the proportion of words recognized as erroneous. Sincethese are discrete functions, those skilled in the art will usuallyapply smoothed versions instead of the actual error rates. Availableoptimization procedures, for example for minimizing a smoothed errorrate, are gradient procedures, inter alia the “generalized probabilisticdescent (GPD)”, as well as all other procedures for non-linearoptimization such as, for example, the simplex method.

[0039] In a preferred embodiment of the invention, however, theoptimization probelm is brought into a form which renders possible theuse of methods of discriminative model combination. The discriminativemodel combination is a general method known from WO 99/31654 for theformation of log-linear combinations of individual models and for thediscriminative optimization of their weight factors. Accordingly, WO99/31654 is hereby included in the present application by reference soas to avoid a repeat description of the methods of discriminative modelcombination.

[0040] The scores p(v_(lj)|w_(l)) are not themselves directly used asparameters in the implementation of the methods of discriminative modelcombination, but instead they are represented in exponential form withnew parameters λ_(lj):

p(v _(lj) |w _(l))=e ^(λlj)  (6)

[0041] Whereas the parameters λ_(lj) in the known methods of non-linearoptimization can be used directly for optimizing the target function,the discriminative model combination aims to achieve a log-linear formof the model scores p(w₁ ^(N)|x). Fir this purpose, the sum of equation(3) is limited to its main contribuent in an approximation:

p(x|w ₁ ^(N)′)=p(x|{tilde over (v)} ₁ ^(N)′)·p({tilde over (v)} ₁ ^(N)′|w ₁ ^(N)′)  (7)

[0042] with $\begin{matrix}{{{\overset{\sim}{v}}_{1}^{N^{\prime}} = {\arg \quad {\max\limits_{v_{1}^{N^{\prime}}}{{p\left( {xv_{1}^{N^{\prime}}} \right)} \cdot {p\left( {v_{1}^{N^{\prime}}w_{1}^{N^{\prime}}} \right)}}}}},} & (8)\end{matrix}$

[0043] Tal\king into consideration the Bayes' theorem mentioned above(cf. equation 2) and the equations (5) and (7), the desired log-linearexpression is found: $\begin{matrix}\begin{matrix}{{\log \quad {p_{\Lambda}\left( {w_{1}^{N}x} \right)}} = \quad {{{- \log}\quad {Z_{\Lambda}(x)}} + {\lambda_{1}\log \quad {p\left( w_{1}^{N} \right)}} +}} \\{\quad {\lambda_{2}\log \quad {p\left( {x{\left. {\overset{\sim}{v}}_{1}^{N} \right) + {\sum\limits_{l = 1}^{D}\quad {\lambda_{lj}{h_{lj}\left( {\overset{\sim}{v}}_{1}^{N} \right)}}}}} \right.}}}\end{matrix} & (9)\end{matrix}$

[0044] To clarify the dependencies of the individual terms on theparameters Λ=(λ₁, λ₂, . . . , λ_(lj), . . . ) to be optimized, Λ wasintroduced as an index in the relevant locations. Furthermore, as isusual in discriminative model combination, the other two summands logp(w₁ ^(N)) and log p(x|{tilde over (v)}₁ ^(N)) were also provided withsuitable parameters λ₁ and λ₂. These, however, need not necessarily beoptimized, but may be chosen to be equal to 1: λ₁=λ₂=1. Nevertheless,their optimization typically does lead to an improved quality of thespeech recognition system. The quantity Z_(λ)(x) depends only on thespoken utterance x (and the parameters Λ) and serves only fornormalization, in as far as it is desirable to interpret the scoreP_(Λ)(w₁ ^(N)|x) as a probability model; i.e. Z_(λ)(x) is determinedsuch that the normalization condition${\sum\limits_{w_{1}^{N}}\quad {p_{\Lambda}\left( {w_{1}^{N}x} \right)}} = 1$

[0045] is complied with.

[0046] The discriminative model combination utilizes inter alia variousforms of smoothed word error rates determined during training as targetfunctions. For this purpose, the training set should consist of the Hspoeken utterances x_(n), n=1, . . . , H. Each such utterance x_(n) hasa spoken word sequence ^((n))w₁ ^(L) ^(_(n)) with a length L_(n)assigned to it, referred to here as the word sequence k_(n) forsimplicity's sake. k_(n) need not necessarily be the actually spokenword sequence; in the case of the so-termed unmonitored adaptation k_(n)would be determined, for example, by means of a preliminary recognitionstep. Furthermore, a quantity ^((n))k_(i), i=1, . . . , K_(n) of K_(n)further word sequences, which compete with the spoken word sequencek_(n) for the highest score in the recognition process, is determinedfor each utterance x_(n), for example by means of a recognition stepwhich calculates a so-termed word graph or N-best list. These competingword sequences are denoted k≠k_(n) for the sake of simplicity, thesymbol k being used as the generic symbol for k_(n) and k≠k_(n).

[0047] The speech recognition system determines the scoresp_(Λ)(k_(n)|x_(n)) and p_(Λ)(k|x_(n)) for the word sequences k_(n) and k(≠k_(n)), indicating how well they match the spoken utterance x_(n).Since the speech recognition system chooses the word sequence k_(n) or kwith the highest score as the recognition result, the word error E(Λ) iscalculated as the Levenshtein distance Γ between the spoken (or assumedto have been spoken) word sequence k_(n) and the chosen word sequence:$\begin{matrix}{{E(\Lambda)} = {\frac{1}{\sum\limits_{n = 1}^{H}\quad L_{n}}{\sum\limits_{n = 1}^{H}{\quad \Gamma \quad \left( {k_{n},{\arg \quad {\max\limits_{k}\quad \left( {\log \quad \frac{p_{\Lambda}\left( {k\left. x_{n} \right)} \right.}{p_{\Lambda}\left( {k_{n}\left. x_{n} \right)} \right.}} \right)}}} \right)}}}} & (10)\end{matrix}$

[0048] This word error rate is smoothed into a continuous functionE_(S)(Λ) capable of differentiation by means of an “indicator function”S(k,n,Λ): $\begin{matrix}{{E_{S}(\Lambda)} = {\frac{1}{\sum\limits_{n = 1}^{H}\quad L_{n}}{\sum\limits_{n = 1}^{H}\quad {\sum\limits_{k \neq k_{n}}\quad {\Gamma \quad \left( {k_{n},k} \right){{S\left( {k,n,\Lambda} \right)}.}}}}}} & (11)\end{matrix}$

[0049] The indicator function S(k,n,Λ) should be close to 1 for the wordsequence with the highest score chosen by the speech recognition system,whereas it should be close to 0 for all other word sequences. A possiblechoice is: $\begin{matrix}{{S\left( {k,n,\Lambda} \right)} = \frac{{p_{\Lambda}\left( {kx_{n}} \right)}^{\eta}}{\sum\limits_{k^{\prime}}{p_{\Lambda}\left( {k^{\prime}x_{n}} \right)}^{\eta}}} & (12)\end{matrix}$

[0050] with a suitable constant η, which may be chosen to be 1 in thesimplest case.

[0051] The target function of equation 11 may be optimized, for example,by means of an iterative gradient method, such that after implementationof the respective partial derivations the following iterative equationfor the parameters λ_(lj) of the pronunciation variants will be obtainedby those skilled in the art: $\begin{matrix}{\lambda_{lj}^{({I + 1})} = {\lambda_{lj}^{(I)} - {\frac{ɛ \cdot \eta}{\sum\limits_{n = 1}^{H}L_{n}}{\sum\limits_{n = 1}^{H}{\sum\limits_{k \neq k_{n}}{{S\left( {k,n,\Lambda^{(I)}} \right)} \cdot {\overset{\sim}{\Gamma}\left( {k,n,\Lambda^{(I)}} \right)} \cdot {\left\lbrack {{h_{lj}\left( {\overset{\sim}{v}(k)} \right)} - {h_{lj}\left( {\overset{\sim}{v}\left( k_{n} \right)} \right)}} \right\rbrack.}}}}}}} & (13)\end{matrix}$

[0052] An iteration step with step width ε will yield the die parametersλ_(lj) ^((I+1))of the (I+1)^(th) iteration step from the parametersλ_(lj) ^((I)) der I^(th) iteration step, {tilde over (v)}(k) and {tildeover (v)}(k_(n)) denote the pronunciation variants with the highestscores (in accordance with equation 8) for the word sequences k andk_(n), and {tilde over (Γ)}(k,n,Λ) is short for: $\begin{matrix}{{\overset{\sim}{\Gamma}\left( {k,n,\Lambda} \right)} = {{\Gamma \left( {k,k_{n}} \right)} - {\sum\limits_{k^{\prime} \neq k_{n}}{{S\left( {k^{\prime},n,\Lambda} \right)}{{\Gamma \left( {k^{\prime},k_{n}} \right)}.}}}}} & (14)\end{matrix}$

[0053] Since the quantity {tilde over (Γ)}(k,n,Λ) is the deviation ofthe error rate Γ(k,k_(n)) around the error rate of all word sequencesweighted with S(k′,n,Λ) , it is possible to characterize word sequencesk with {tilde over (Γ)}(k,n,Λ)<0 as correct word sequences because theyexhibit an error rate lower than the one weighted with S(k′,n,Λ) . Theiteration rule of equation 13 accordingly stipulates that the parametersλ_(lj), and thus the scores p(v_(lj)|w_(l)) are to be enlarged for thosepronunciation variants v_(lj), die, judging from the spoken wordsequence k_(n), occur frequently in correct word sequences, i.e. forwhich it holds that h_(lj)({tilde over (v)}(k_(n)))−h_(lj)({tilde over(v)}(k_(n))>0 in correct word sequences. A similar rule applies tovariants which occur only seldom in bad word sequences. On the otherhand, the scores are to be lowered for variants which occur only seldomin good word sequences and frequently in bad ones. This interpretationis a good example of the advantageous effect of the invention.

[0054]FIG. 1 shows an embodiment of a system according to the inventionfor the training of parameters of a speech recognition system whereinexactly one pronunciation variant of a word is associated with aparameter. A method according to the invention for the training ofparameters of a speech recognition system which are associated withexactly one pronunciation variant of a word is carried out on a computer1 under the control of a program stored in a program memory 2. Amicrophone 3 serves to record spoken utterances, which are stored in aspeech memory 4. It is alternatively possible for such spoken utterancesto be transferred into the speech memory from other data carriers or vianetworks instead of through recording via the microphone 3.

[0055] Parameter memories 5 and 6 serve to store the parameters. It isassumed that in this embodiment an iterative optimization process of thekind discussed above is carried out. The parameter memory 5 thencontains, for example, for the calculation of the (I+1)^(th) iterationstep the parameters of the I^(th) iteration step known at that stage,while the parameter memory 6 receives the new parameters of the(I+1)^(th) iteration step. In the next stage, i.e. the (I+2)^(th)iteration step in this example, the parameter memories 5 and 6 willexchange roles.

[0056] A method according to the invention is carried out on ageneral-purpose computer 1 in this embodiment. This will usually containthe memories 2, 5, and 6 in one common arrangement, while the speechmemory 4 is more likely to be situated in a central server which isaccessible via a network. Alternatively, however, special hardware maybe used for implementing the method, which hardware may be constructedsuch that the entire method or parts thereof can be carried outparticularly quickly.

[0057]FIG. 2 shows the embodiment of a method according to the inventionfor the training of parameters of a speech recognition system which areeach associated with exactly one pronunciation variant of a word from avocabulary in the form of a flowchart. After the start block 101, inwhich general preparatory measures are taken, the start values Λ⁽⁰⁾ forthe parameters are chosen in block 102, and the iteration countervariable I is set for 0: I=0. A “maximum likelihood” method as describedabove may be used for estimating the scores p(v_(lj)|w_(l)), from whichthe start values of λ_(lj) ⁽⁰⁾ are subsequently obtained throughformation of the logarithm function.

[0058] Block 103 starts the progress through the training set of spokenutterances, for which the counter variable n is set for 1: n=1. Theselection of the competing word sequences k≠k_(n) so as to match thespoken utterance x_(n) takes place in block 104. If the spoken wordsequence k. matching the spoken utterance x_(n) is not yet known fromthe training data, it may be estimated here by means of the updatedparameter formation of the speech recognition system in block 104. It isalso possible, however, to carry out such an estimation once only inadvance, for example in block 102. Furthermore, a separate speechrecognition system may alternatively be used for estimating the spokenword sequence k_(n).

[0059] In block 105, the progress through the quantity of competing wordsequences k≠k_(n) is started, for which purpose the counter variable kis set for 1: k=1. The calculation The calculation of the individualterms and the accumulation of the double sum arising in equation 13 fromthe counter variables n and k take place in block 106. It is tested inthe decisin block 107, which limits the progress through the quantity ofcompeting word sequences k≠k_(n), whether any further competing wordsequences k≠k_(n) are present. If this is the case, the control switchesto block 108, in which the counter variable k is increased by 1: k=k+1,whereupon the control goes to block 106 again. If not, the control goesto decision block 109, which limits the progress through the trainingset of spoken utterances, for which purpose it is tested whether anyfurther training utterances are available. If this is the case, thecounter variable n is increased by 1: n=n+1, in block 110 and thecontrol returns to block 104 again. If not, the progress through thetraining quantity of spoken utterances is ended and the control is movedto block 111.

[0060] In block 111, the new values of the parameters Λ are calculated,i.e. in in the first iteration step I=1 the values Λ⁽¹⁾. In thesubsequent decision block 112, a stop criterion is applied so as toascertain whether the optimization has been sufficiently converged.Various methods are known for this. For example, it may be required thatthe relative changes of the parameters or those of the target functionsshould fall below a given threshold. In any case, however, the iterationmay be broken off after a given maximum number of iteration steps.

[0061] If the iteration has not yet been sufficiently converged, theiteration counter variable I is increased by 1 in block 113: I=I+1,whereupon in Block 103 the iteration loop is entered again. In theopposite case, the iteration is concluded with general rearrangementmeasures in block 114.

[0062] A special iterative optimization process was described in detailabove for determining the parameters λ_(lj), but it will be clear tothose skilled in the art that other optimization methods mayalternatively be used. In particular, all methods known in connectionwith discriminative model combination are applicable. Special mention ismade here again of the methods disclosed in WO 99/31654. This describesin particular also a method which renders it possible to determine theparameters non-iteratively in a closed form. The parameter vector Λ isthen obtained through solving of a linear equation system having theform Λ=Q⁻¹P, wherein the matrix Q and the vector P result from scorecorrelations and the target function. The reader is referred to WO99/31654 for further details.

[0063] After the parameters λ_(lj) have been determined, they can beused for selecting the pronunciation variants v_(lj) also included inthe pronunciation lexicon. Thus, for example, variants v_(lj) withscores p(v_(lj)|w_(l)), which are below a given threshold value, may beremoved from the pronunciation lexicon. Furthermore, a pronunciationlexicon may be created with a given number of variants v_(lj) in that asuitable number of variants v_(lj) having the lowest scoresp(v_(lj)|w_(l)) are deleted.

1. A method of training parameters of a speech recognition system, eachparameter being associated with exactly one pronunciation variant of aword from a vocabulary, which method comprises the steps of: makingavailable a training set of acoustical speech signals, and determiningthe parameters through discriminative optimization of a target function.2. A method as claimed in claim 1, characterized in that the parameterλ_(lj) associated with the j^(th) pronunciation variant v_(lj) of thel^(th) word w_(l) from the vocabulary has the following exponentialrelationship with a score p(v_(lj)|w_(l)), such that the word w_(l) ispronounced as the pronunciation variant v_(lj): p(v _(lj) |w _(l))=e^(λ) ^(_(lj))
 3. A method as claimed in claim 1 or 2, characterized inthat the target function is calculated as a continuous function, whichis capable of differentiation, of the following quantities: therespective Levenshtein distances Γ(k_(n),k) between a spoken wordsequence k_(n) associated with a corresponding acoustical speech signalx_(n) from the training set and further word sequences k≠k_(n)associated with the speech signal and competing with k_(n), andrespective scores p_(Λ)(k|x_(n)) and p_(Λ)(k_(n)|x_(n)) indicating howwell the further word sequences k≠k_(n) and the spoken word sequencek_(n) match the speech signal x_(n).
 4. A method as claimed in any oneof the claims 1 to 3, characterized in that a probability model is usedas said respective score p(v_(lj)|w_(l)), representing the probabilitythat the word w_(l) is pronounced as the pronunciation variant v_(lj)and a probability model is used as said respective scorep_(Λ)(k_(n)|x_(n)), representing the probability that the spoken wordsequence k_(n) associated with the corresponding acoustical speechsignal x_(n) from the training set is spoken as the speech signal x_(n),and/or a probability model is used as said respective scorep_(Λ)(k|x_(n)), representing the probability that the relevant competingword sequence k≠k_(n) is spoken as the speech signal x_(n).
 5. A methodas claimed in any one of the claims 1 or 4, characterized in that thediscriminative optimization of the target function is carried out by oneof the methods of discriminative model combination.
 6. A system for thetraining of parameters of a speech recognition system, each parameterbeing associated with exactly one pronunciation variant of a word from avocabulary, which system is designed for: making available a trainingset of acoustical speech signals, and determining the parameters throughdiscriminative optimization of a target function.
 7. A method oftraining parameters of a pattern recognition system, each parameterbeing associated with exactly one realization variant of a pattern froman inventory, which method comprises the steps of: making available atraining set of patterns, and determining the parameters throughdiscriminative optimization of a target function.
 8. A system for thetraining of parameters of a pattern recognition system, each parameterbeing associated with exactly one realization variant of a pattern froman inventory, which system is designed for: making available a trainingset of patterns, and determining the parameters through discriminativeoptimization of a target function.
 9. Parameters of a patternrecognition system which are each associated with exactly onerealization variant of a pattern from an inventory and which weregenerated by means of a method as claimed in claim
 7. 10. A data carrierwith parameters of a pattern recognition system as claimed in claim 9.